# Multimodal visual understanding
Gemma 3 12b It Quantized.w8a8
An INT8 quantized version based on google/gemma-3-12b-it, supporting visual text input and text output, suitable for efficient inference deployment
Image-to-Text
Transformers

G
RedHatAI
237
1
Qwen2.5 VL 32B Instruct Exl2 4 25bpw
Apache-2.0
Qwen2.5-VL-32B-Instruct is the latest vision - language model in the Qwen family, with powerful multimodal understanding and generation capabilities, supporting the interaction of images, videos, and text.
Text-to-Image
Transformers English

Q
christopherthompson81
68
3
Amoral Gemma3 12B Vision
Vision-enhanced version based on soob3123/amoral-gemma3-12B, combining Gemma3-12B large language model with visual encoder for multimodal tasks
Image-to-Text
Transformers English

A
gghfez
25
2
Qwen2 VL 72B Instruct GGUF
Other
The GGUF quantized version of Qwen2-VL-72B-Instruct, supporting multimodal image-text to text conversion, which can be run through LlamaEdge.
Image-to-Text
Transformers English

Q
second-state
221
0
Featured Recommended AI Models